Data Preparation
Data preparation means converting raw data into a format that machine learning can use.
Raw data is usually designed for business systems, logs, or analytics. It is not usually ready for model training.
For this beginner example, we will use mock GA4-style event data.
Mock Data Source
The sample data is generated by this script:
The generated file is:
ga4_mock_data.json
This file uses JSON Lines format.
That means:
one line = one JSON object = one event
This page only uses the generated mock GA4 event data. Later lessons still use Python, pandas, and scikit-learn.
Partial Sample Data
The raw data looks like this:
{
"event_date": "20230101",
"event_timestamp": 1672574760000000,
"event_name": "session_start",
"user_pseudo_id": "f282d77c-aef5-4089-b73b-3d5b00c914f2",
"event_params": [
{
"key": "ga_session_id",
"value": {
"int_value": 1780749951
}
},
{
"key": "page_title",
"value": {
"string_value": "Home Page"
}
}
]
}
Another event may contain product information:
{
"event_date": "20230101",
"event_timestamp": 1672574763973844,
"event_name": "view_item",
"user_pseudo_id": "f282d77c-aef5-4089-b73b-3d5b00c914f2",
"event_params": [
{
"key": "ga_session_id",
"value": {
"int_value": 1780749951
}
},
{
"key": "page_title",
"value": {
"string_value": "Home Page"
}
},
{
"key": "item_id",
"value": {
"string_value": "SKU_4001"
}
},
{
"key": "price",
"value": {
"double_value": 45.0
}
}
]
}
The important idea is:
raw event data is nested
For beginner machine learning practice, we will flatten the useful fields into normal columns.
Load Data With pandas
Use read_json with lines=True.
import pandas as pd
raw_events = pd.read_json("ga4_mock_data.json", lines=True)
Preview the raw data.
raw_events.head()
Check the columns.
raw_events.columns
Expected columns:
event_dateevent_timestampevent_nameuser_pseudo_idevent_params
Flatten Event Parameters
The event_params column contains a list of key-value objects.
For example, ga_session_id, item_id, and price are inside event_params.
Create a helper function to extract one parameter.
def get_event_param(event_params, target_key):
for param in event_params:
if param["key"] != target_key:
continue
value = param["value"]
for value_type in ["int_value", "double_value", "string_value"]:
if value_type in value:
return value[value_type]
return None
Create feature_events, a flatter DataFrame for beginner-friendly analysis.
feature_events = raw_events.copy()
feature_events["event_date"] = pd.to_datetime(
feature_events["event_date"],
format="%Y%m%d",
)
feature_events["user_id"] = feature_events["user_pseudo_id"]
feature_events["session_id"] = feature_events["event_params"].apply(
lambda params: get_event_param(params, "ga_session_id")
)
feature_events["item_id"] = feature_events["event_params"].apply(
lambda params: get_event_param(params, "item_id")
)
feature_events["price"] = feature_events["event_params"].apply(
lambda params: get_event_param(params, "price")
)
feature_events = feature_events[
[
"event_date",
"event_timestamp",
"user_id",
"session_id",
"event_name",
"item_id",
"price",
]
]
Now feature_events is easier to use:
| event_date | user_id | session_id | event_name | item_id | price |
|---|---|---|---|---|---|
| 2023-01-01 | user uuid | 1780749951 | session_start | ||
| 2023-01-01 | user uuid | 1780749951 | view_item | SKU_4001 | 45.0 |
This is still event-level data.
For machine learning, we will later convert it into user-level data.
many events -> one row per user
Basic Cleaning
Check the first few rows.
feature_events.head()
Check column types.
feature_events.dtypes
Check missing values.
feature_events.isna().sum()
Missing values are not always bad.
For example, session_start may not have item_id or price. That is normal because no product was viewed yet.
Define The Problem First
Before writing more code, define the machine learning problem clearly.
For this lesson, the problem is:
Predict whether a user has purchase behavior based on simple event summaries.
This is a simplified classroom example.
In real projects, we usually need time windows. For now, we skip that because the first goal is to understand the basic data shape.
Unit Of Prediction
The unit of prediction answers:
What does one row mean?
For this example:
one row = one user
This is important because the model needs a stable meaning for each row.
Bad design:
one row = one event
This would make the label confusing because one user can have many events.
Better design:
one row = one user with summarized behavior
Output For Feature Engineering
For this beginner example, we do not split events by time window.
The output of this lesson is:
feature_events
The next lesson will use feature_events to build a user-level feature table.
Basic Data Checks
Check event counts.
feature_events["event_name"].value_counts()
Check date range.
feature_events["event_date"].min(), feature_events["event_date"].max()
Check user count.
feature_events["user_id"].nunique()
Data Preparation Checklist
Before moving to feature engineering, confirm:
- The business question is clear
- The prediction target is clear
- One row has a clear meaning
- Raw data has enough rows
- Raw data has the events needed to create features and labels
Good machine learning starts before model training. It starts with a clear data design.